In [1]:
!conda install -c bioconda perl-xml-parser --yes
!conda install -c anaconda beautifulsoup4 --yes
!conda install -c anaconda lxml --yes
from bs4 import BeautifulSoup #Needed for web scraping
import requests #Needed for handling requests
import pandas as pd #Using to create dataframes and manipulate data
!conda install -c conda-forge geopy --yes
!conda install -c conda-forge folium=0.5.0 --yes
import folium #Geographic visualization
from geopy.geocoders import Nominatim #Handling Coordinates for requesting data through Foursqare API
from pandas.io.json import json_normalize #Normalizing JSON messages for data analysis
import numpy as np
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - perl-xml-parser


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2020.4.5.1         |           py36_0         155 KB
    ------------------------------------------------------------
                                           Total:         155 KB

The following packages will be SUPERSEDED by a higher-priority channel:

  certifi            conda-forge::certifi-2020.4.5.1-py36h~ --> pkgs/main::certifi-2020.4.5.1-py36_0



Downloading and Extracting Packages
certifi-2020.4.5.1   | 155 KB    | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - beautifulsoup4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2020.4.5.1         |           py36_0         159 KB  anaconda
    ------------------------------------------------------------
                                           Total:         159 KB

The following packages will be SUPERSEDED by a higher-priority channel:

  ca-certificates    conda-forge::ca-certificates-2020.4.5~ --> anaconda::ca-certificates-2020.1.1-0
  certifi                                         pkgs/main --> anaconda
  openssl            conda-forge::openssl-1.1.1g-h516909a_0 --> anaconda::openssl-1.1.1g-h7b6447c_0



Downloading and Extracting Packages
certifi-2020.4.5.1   | 159 KB    | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - geopy


The following packages will be UPDATED:

  ca-certificates      anaconda::ca-certificates-2020.1.1-0 --> conda-forge::ca-certificates-2020.4.5.1-hecc5488_0

The following packages will be SUPERSEDED by a higher-priority channel:

  certifi               anaconda::certifi-2020.4.5.1-py36_0 --> conda-forge::certifi-2020.4.5.1-py36h9f0ad1d_0
  openssl               anaconda::openssl-1.1.1g-h7b6447c_0 --> conda-forge::openssl-1.1.1g-h516909a_0


Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

In the web research phase, the author intends to scrape information from relevant websites in order to determine what would be the ideal location to open a Hungarian fine dining restaurant.

In [3]:
hungarian_americans = requests.get('https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_income').text
hungarian_americans2 = requests.get('https://en.wikipedia.org/wiki/Hungarian_Americans').text
hungarian_americans3 = requests.get('https://simple.wikipedia.org/wiki/List_of_U.S._states%27_largest_cities').text
hungarian_americans4 = requests.get('https://en.wikipedia.org/wiki/List_of_United_States_counties_by_per_capita_income').text
#URLs for web research and analysis                               
soup  = BeautifulSoup(hungarian_americans,'lxml')
soup2 = BeautifulSoup(hungarian_americans2, 'lxml')
soup3 = BeautifulSoup(hungarian_americans3, 'lxml')
soup4 = BeautifulSoup(hungarian_americans4, 'lxml')
#Pairing with BeautifulSoup to scrape tables
tables = soup.find('table', class_ = "wikitable")
tables2 = soup2.find('table', class_ = "wikitable")
tables3 = soup3.find('table', class_ = "wikitable")
tables4 = soup4.find('table', class_ = "wikitable")

The author found a table containing the size of hungarian-american population in the different states across the US. The author's hypothesis works on the assumption that opening a Hungarian restaurant needs to happen in a major city where hungarian-american and eastern european heritage is significant, as well as has strong economic power.

In [4]:
row= []
for tr in tables2.find_all('tr'):
    data = tr.find_all('td')
    row.append([i.text.strip() for i in data]) 
labels = tables2.find_all('th')
labels = [c.text for c in labels]
labels = [i.strip() for i in labels] #Scraping table contents from Wikipedia website 
df2 = pd.DataFrame( data  = row , columns = labels) #Putting results into a Pandas Dataframe
new_data = df2.drop([0]) #Dropping the first line of the dataframe
new_data = new_data.rename(columns={'Population[9]':'Population'}) #Renaming one of the columns for better readibility
new_data
Out[4]:
State Population
1 Ohio 203,417
2 New York 157,863
3 California 133,988
4 Pennsylvania 132,184
5 New Jersey 115,615
6 Michigan 98,036
7 Florida 96,885

Now there is a list of 7 states with a significant hungarian-american population. The author intends to pair that with the median income for each one of those states. In order to do that, the median income of those states needs to be scraped from the web.

In [5]:
row= []
for tr in tables.find_all('tr'):
    data = tr.find_all('td')
    row.append([i.text.strip() for i in data]) 
labels = tables.find_all('th')
labels = [c.text for c in labels]
labels = [i.strip() for i in labels] 
df = pd.DataFrame( data  = row , columns = labels)
df= df.drop([0])
df.rename(columns={'State or territory':'State', '2018':'Median Income'}, 
                 inplace=True) #Renaming olumns in order to ease merging with other data.
df.head(1) #Sampling the dataframe
Out[5]:
Rank State Median Income 2017 2016 2015 2014[note 2]
1 1 Washington, D.C. $85,203 $82,372 $75,506 $75,628 $71,648

In order to identify the possible best locations for further analysis, the author identifies the hiearchy of the "hungarian-american states" by adding median income numbers.

In [6]:
merged_data = pd.merge(new_data, df[["State", "Median Income"]], on="State", how="left")
merged_data = merged_data.sort_values(by=['Median Income'], ascending=False)
merged_data
Out[6]:
State Population Median Income
4 New Jersey 115,615 $81,740
2 California 133,988 $75,277
1 New York 157,863 $67,844
3 Pennsylvania 132,184 $60,905
5 Michigan 98,036 $56,697
0 Ohio 203,417 $56,111
6 Florida 96,885 $55,462

At this point, it is assumed that New Jersey, California and New York states are very likely favorable locations for a new Hungarian restaurant overseas.

In order to narrow the search, it is vitally important to pinpoint some big cities and assess the size of the population. This will help the author determine the size of potential markets. In order to do this, the author will pull and pair the populous of the biggest cities in the "hungarian-american states".

In [8]:
row= []
for tr in tables3.find_all('tr'):
    data = tr.find_all('td')
    row.append([i.text.strip() for i in data]) 
labels = tables3.find_all('th')
labels = [c.text for c in labels]
labels = [i.strip() for i in labels] 
df3 = pd.DataFrame( data  = row , columns = labels)
df3 = df3.rename(columns={'State,Federal District,or Territory':'State'})
df3 = df3.drop([0])
df3.head(1)
Out[8]:
State State,Federal District,or Territory Population[1] Most populous City population 2nd most populous 3rd most populous 4th most populous 5th most populous Capital (if not otherwise listed)
1 Alabama 4,817,786 Birmingham 212,237 Montgomery Huntsville Mobile Tuscaloosa [2]

Pairing this table with the initial one to enhance understanding.

In [10]:
merged_data2 = pd.merge(merged_data, df3[["State", 'Most populous', "City population"]], on="State", how="left")
merged_data2
Out[10]:
State Population Median Income Most populous City population
0 New Jersey 115,615 $81,740 Newark 277,140
1 California 133,988 $75,277 Los Angeles 4,792,621
2 New York 157,863 $67,844 New York City 8,175,133
3 Pennsylvania 132,184 $60,905 Philadelphia 1,526,006
4 Michigan 98,036 $56,697 Detroit 713,777
5 Ohio 203,417 $56,111 Columbus 879,170
6 Florida 96,885 $55,462 Jacksonville 880,619

Even though New Jersey has the highest median income out of all states, its biggest city Newark has a population of 277K people. The author intends to point out that based on the fact that Hungarian cuisine is relatively unknown overseas, a small market disqualifies New Jersey as a potential location.

In order to further enhance understanding, the author decided to pull median income for cities, a further indicator of economic strength and indiciation of spending power.

In [11]:
row= []
for tr in tables4.find_all('tr'):
    data = tr.find_all('td')
    row.append([i.text.strip() for i in data]) 
labels = tables4.find_all('th')
labels = [c.text for c in labels]
labels = [i.strip() for i in labels] 
df4 = pd.DataFrame( data  = row , columns = labels)
df4 = df4.drop([0])
df4.head(1)
Out[11]:
Rank County or county-equivalent State, federal district or territory Per capitaincome Medianhouseholdincome Medianfamilyincome Population Number ofhouseholds
1 1 New York County New York $62,498 $69,659 $84,627 1,605,272 736,192

In order to choose the location, Median Family Income will be used to determine the city to be targeted. Lets see what our top3 potential locations are.

In [12]:
df4 = df4.rename(columns={'State, federal district or territory':'State'})
values = ["New York County", "Los Angeles", "Philadelphia"]
df4 = df4.loc[df4['County or county-equivalent'].isin(values)]


merged_data3 = pd.merge(merged_data2, df4[["State", 'Medianhouseholdincome']], on="State", how="left")


merged_data3 = merged_data3.rename(columns={'Population': 'Hungarian Ancestry Population','2018':'Median Household Income / State', 'Most populous': 'Biggest City', 'Medianhouseholdincome': 'Median Household Income / City'})
merged_data3 = merged_data3.replace(np.nan, '', regex=True)
merged_data3 = merged_data3.sort_values(by=['Median Household Income / City'], ascending=False)
merged_data3.head(3)
Out[12]:
State Hungarian Ancestry Population Median Income Biggest City City population Median Household Income / City
2 New York 157,863 $67,844 New York City 8,175,133 $69,659
1 California 133,988 $75,277 Los Angeles 4,792,621 $55,909
3 Pennsylvania 132,184 $60,905 Philadelphia 1,526,006 $37,192

As shown on the above list, New York has the biggest population, median household income and hungarian-american population, it is decided that New York will be targeted to open a new Hungarian restaurant.

In the second phase of the analysis, the author will be working with some Foursquare data to pinpoint possible locations for the restaurant in NYC.

In [26]:
CLIENT_ID = 'Y2NPMAP2FXXT22HUQOAKKNDAMUQHUOZ2PN3DOSQVGYCKBYE5' 
CLIENT_SECRET = '54NFIC2OFJ3JAUFAGAEUP1OP3A2AMVHUONV5Q1UWJI0EXTGS' 
VERSION = '20200425'
LIMIT = 1000

address = 'Manhattan, New York, United States'
#User specific information to ask for Foursquare data.

geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
#Translating location information to latitutes and longitudes

search_query = 'Hungarian' #Looking for all Hungarian venues in NYC.
radius = 60000 #Within a radius of 60km
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)
results = requests.get(url).json()
In [14]:
results = requests.get(url).json() #Request JSON from Foursquare API
venues = results['response']['venues'] #Get venues data from JSON


dataframe = json_normalize(venues)
dataframe.head(3) #Normalize JSON results into a Pandas dataframe and sample data
/home/jupyterlab/conda/envs/python/lib/python3.6/site-packages/ipykernel_launcher.py:5: FutureWarning: pandas.io.json.json_normalize is deprecated, use pandas.json_normalize instead
  """
Out[14]:
id name categories referralId hasPerk location.address location.crossStreet location.lat location.lng location.labeledLatLngs location.distance location.postalCode location.cc location.city location.state location.country location.formattedAddress
0 45caf0baf964a52080421fe3 Hungarian House [{'id': '4bf58dd8d48988d1f2931735', 'name': 'P... v-1587885190 False 213 E 82nd St Betw. 2nd & 3rd Ave. 40.776127 -73.954962 [{'label': 'display', 'lat': 40.77612688560968... 1558 10028 US New York NY United States [213 E 82nd St (Betw. 2nd & 3rd Ave.), New Yor...
1 40f32600f964a520720a1fe3 The Hungarian Pastry Shop [{'id': '4bf58dd8d48988d1e0931735', 'name': 'C... v-1587885190 False 1030 Amsterdam Ave btwn W 110th & W 111th St 40.803562 -73.963475 [{'label': 'display', 'lat': 40.80356193389294... 1580 10025 US New York NY United States [1030 Amsterdam Ave (btwn W 110th & W 111th St...
2 4edd0438f9abeea3dcbc90d0 HHRF (Hungarian Human Rights Foundation) [{'id': '4bf58dd8d48988d124941735', 'name': 'O... v-1587885190 False 120 E 90th St, New York, NY 10128 NaN 40.781747 -73.953751 [{'label': 'display', 'lat': 40.78174663699295... 1018 10128 US New York NY United States [120 E 90th St, New York, NY 10128, New York, ...

In order to better understand this data, we need to extract category types, which is nested inside categories.

In [16]:
# Stripping down dataset to only include category, name and location related data.
filtered_columns = ['name', 'categories'] + [col for col in dataframe.columns if col.startswith('location.')] + ['id']
dataframe_filtered = dataframe.loc[:, filtered_columns]

#Function to get category, thanks for this Coursera!
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

#Filtering the categories to ease understanding
dataframe_filtered['categories'] = dataframe_filtered.apply(get_category_type, axis=1)

#Further cleaning column names
dataframe_filtered.columns = [column.split('.')[-1] for column in dataframe_filtered.columns]
dataframe_filtered.head(1)
Out[16]:
name categories address crossStreet lat lng labeledLatLngs distance postalCode cc city state country formattedAddress id
0 Hungarian House Performing Arts Venue 213 E 82nd St Betw. 2nd & 3rd Ave. 40.776127 -73.954962 [{'label': 'display', 'lat': 40.77612688560968... 1558 10028 US New York NY United States [213 E 82nd St (Betw. 2nd & 3rd Ave.), New Yor... 45caf0baf964a52080421fe3

The author's intention is to group Hungarian-related venues in NYC according to eatery/non-eatery categories. In order to support this, some cleansing is needed.

In [17]:
ny_hungarian = dataframe_filtered[['name','categories','lat','lng']]

food_venues = ["Snack Place", "Coffee Shop", 'Bakery', 'Hungarian Restaurant', 'Deli/Bodega','Food', 'Restaurant', 'Eastern European Restaurant']
food_nyc = ny_hungarian.loc[ny_hungarian['categories'].isin(food_venues)]

other_venues = ['Event Space', 'Performing Arts Venue', 'Assisted Living', 'Embassy / Consulate', 'Church', 'Construction & Landscaping', 'Office', 'Grocery Store' 'History Museum', 'Nightlife Spot', 'Grocery Store', 'Building', 'None', 'Soccer Stadium']
other_venues = ny_hungarian.loc[ny_hungarian['categories'].isin(other_venues)]
In [18]:
food_nyc.shape #We found 9 food and hungarian related venues in NYC and surrounding areas.
Out[18]:
(9, 4)
In [19]:
other_venues.shape #We have found 19 venues which are not food but hungarian related in NYC
Out[19]:
(19, 4)

Lets see where the direct competition is located for the restaurant and visualize results on a map.

In [22]:
venues_map = folium.Map(location=[40.77, -73.90], zoom_start=10)

for lat, lng, label in zip(food_nyc['lat'], food_nyc['lng'], food_nyc['name']):
        folium.features.CircleMarker(
            [lat, lng],
            radius=3,
            poup=label,
            fill=True,
            color='red',
            fill_color='red',
            fill_opacity=0.6
        ).add_to(venues_map)
        
geo_json = 'https://data.cityofnewyork.us/api/geospatial/tqmj-j8zm?method=export&format=GeoJSON'

venues_map.choropleth(
    geo_data=geo_json,
    fill_color='black', 
    key_on='feature.properties.boro_name.bronx',
    fill_opacity=0.0, 
    line_opacity=1,
)
venues_map
Out[22]:

As it is clearly visible on the map, the majority of the restaurant's competition is in the Manhattan borough, while there is one hungarian eatery in Brooklyn and Queens.

Lets see how this maps looks like if the other Hungarian venues are added to the map.

In [23]:
venues_map = folium.Map(location=[40.77, -73.90], zoom_start=10)

for lat, lng, label in zip(food_nyc['lat'], food_nyc['lng'], food_nyc['name']):
        folium.features.CircleMarker(
            [lat, lng],
            radius=3,
            poup=label,
            fill=True,
            color='red',
            fill_color='red',
            fill_opacity=0.6
        ).add_to(venues_map)
        
for lat, lng, label in zip(other_venues['lat'], other_venues['lng'], other_venues['name']):
        folium.features.CircleMarker(
            [lat, lng],
            radius=4,
            poup=label,
            fill=True,
            color='blue',
            fill_color='blue',
            fill_opacity=0.6
        ).add_to(venues_map)
        
geo_json = 'https://data.cityofnewyork.us/api/geospatial/tqmj-j8zm?method=export&format=GeoJSON'

venues_map.choropleth(
    geo_data=geo_json,
    fill_color='black', 
    key_on='feature.properties.boro_name.bronx',
    fill_opacity=0.0, 
    line_opacity=1,
)
venues_map
Out[23]:

It is now visible that the majority of Hungarian restaurants and venues are in Manhattan, projecting a strong competition for the potential Restaurant.

The author intends to explore the neighboring two boroughs, Queens and Brooklyn as a potential location for the restaurant.

In [27]:
def get_venues(lat,lng):
    
    #Variables Needed for API call
    radius=1000
    LIMIT=100
    CLIENT_ID = 'Y2NPMAP2FXXT22HUQOAKKNDAMUQHUOZ2PN3DOSQVGYCKBYE5'
    CLIENT_SECRET ='54NFIC2OFJ3JAUFAGAEUP1OP3A2AMVHUONV5Q1UWJI0EXTGS' 
    VERSION = '20200425' 
    #Version and Credentials
    
    #URL to call for data
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
    
    #This is where we call the data.
    results = requests.get(url).json()
    venue_data=results["response"]['groups'][0]['items']
    venue_details=[]
    for row in venue_data:
        try:
            venue_id=row['venue']['id']
            venue_name=row['venue']['name']
            venue_category=row['venue']['categories'][0]['name']
            venue_details.append([venue_id,venue_name,venue_category])
        except KeyError:
            print('this shit aint good')
        
    column_names=['ID','Name','Category']
    df = pd.DataFrame(venue_details,columns=column_names)
    return df
In [28]:
def get_new_york_data():
    url='https://cocl.us/new_york_dataset'
    resp=requests.get(url).json()
    #This is a dataset provided by Coursera, thanks guys!
    features=resp['features']
    
    #Defining the dataframe to load data to
    column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 
    #Instantiate dataframe
    new_york_data = pd.DataFrame(columns=column_names)
    
    for data in features:
        borough = data['properties']['borough'] 
        neighborhood_name = data['properties']['name']
        
        neighborhood_latlon = data['geometry']['coordinates']
        neighborhood_lat = neighborhood_latlon[1]
        neighborhood_lon = neighborhood_latlon[0]
    
        new_york_data = new_york_data.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
    
    return new_york_data

We need to filter results for only Brooklyn and Queens.

In [35]:
new_york_data=get_new_york_data()

new_york_boros = ["Queens", "Brooklyn"]
new_york_data = new_york_data.loc[new_york_data['Borough'].isin(new_york_boros)]#Filtering results for Brooklyn and Queens only.
new_york_data.head(2) #Sampling the results
Out[35]:
Borough Neighborhood Latitude Longitude
46 Brooklyn Bay Ridge 40.625801 -74.030621
47 Brooklyn Bensonhurst 40.611009 -73.995180

Now the author extracted all needed location and Neighborhood coordinates, lets pair those with all restaurant type of venues in the different neighborhoods. This will be used to determine a location where there is no large density of restaurants and is still in close proximity to Manhattan.

In [44]:
# prepare neighborhood list that contains indian restaurant
column_names=['Borough', 'Neighborhood', 'ID','Name']
nyc_restaurants=pd.DataFrame(columns=column_names)
count=1
for row in new_york_data.values.tolist():
    Borough, Neighborhood, Latitude, Longitude=row
    try:
        venues = get_venues(Latitude,Longitude)
        resturants=venues[venues['Category']=='Restaurant'] 
    except: 
        pass 
    for resturant_detail in resturants.values.tolist():
        id, name , category=resturant_detail
        nyc_restaurants = nyc_restaurants.append({'Borough': Borough,
                                                'Neighborhood': Neighborhood, 
                                                'ID': id,
                                                'Name' : name
                                               }, ignore_index=True)
    count+=1

Lets see how the sample of this data looks like.

In [45]:
nyc_restaurants.head(2)
Out[45]:
Borough Neighborhood ID Name
0 Brooklyn Bay Ridge 5619c289498ebe31e3cf7d18 Salud BKLYN
1 Brooklyn Greenpoint 55d7c6bb498efa168c9483a9 Archestratus Books & Foods

Lets see how many restaurants does Brooklyn and Queens have.

In [56]:
nyc_restaurants['count'] = nyc_restaurants.groupby('Borough')['Borough'].transform('count')
new = nyc_restaurants[['Borough', 'count']].copy()
new = new.drop_duplicates('Borough')
new
Out[56]:
Borough count
0 Brooklyn 58
47 Queens 39

According to Foursquare, there are a total number of 58 venues labeled as restaurants in Brooklyn and 39 in Queens.

Lets see how the distribution of those restaurants look like on a map.

In [58]:
nyc_restaurants['count'] = nyc_restaurants.groupby('Neighborhood')['Neighborhood'].transform('count')
new2 = nyc_restaurants[['Borough', 'Neighborhood', 'count']].copy()
new2 = new2[['Borough', 'Neighborhood', 'count']]
new2.head(2)
Out[58]:
Borough Neighborhood count
0 Brooklyn Bay Ridge 1
1 Brooklyn Greenpoint 2
In [61]:
gjson = r'https://raw.githubusercontent.com/veltman/snd3/master/data/nyc-neighborhoods.geo.json'
nyc_map = folium.Map(location=[40.77, -73.90], zoom_start=10)

#Generate NYC Map
nyc_map.choropleth(
    geo_data=gjson,
    data=new2,
    columns=['Neighborhood', 'count'],
    key_on='feature.properties.name',
    fill_color='YlOrRd', 
    fill_opacity=0.7, 
    line_opacity=1,
    legend_name='Restaurant Count in NYC'
)
nyc_map
Out[61]:

It seems that both in Brooklyn and Queens, the areas in close proximity to Manhattan are not densely populated with restaurants, hence could be ideal to open a new venue there.

In [ ]:
 
In [ ]: